Consensus pattern alignment to find protein-protein interactions in text

نویسندگان

  • Jörg Hakenberg
  • Michael Schroeder
  • Ulf Leser
چکیده

“Don’t I know you from somewhere?” – comparing new to known texts plays a key role in the system we propose for searching protein–protein interactions (PPIs). Our system builds on an inexact pattern matching strategy, where patterns (linguistic frames) reflect the compositional structure of known occurrences of PPIs in text. To describe this structure, part-of-speech tags (verbs etc.) and entity classes (proteins), words, and word stems are used. Consider the sentences “Sky1p phosphorylates Npl3p” and “Akt phosphorylates beta-catenin”. Both have a structure in common that connects two proteins with a single verb. From comparable systems proposed before [1, 2], it became clear that collecting a suitable set of patterns is of major importance, and this step forms the main component of our system. From the IntAct database [5], we extract all pairs of proteins known to interact. We scan PubMed for textual evidences for each such interaction, and retain all single sentences that describe them. Using pairwise sentence alignment as a similarity scoring function, we perform a clustering on the resulting set of sentences. Within each cluster, multiple sentence alignment (MSA) identifies commonalities and variable positions across all sentences, expressed in a consensus pattern. Figure 1 shows an example MSA with four sentences that define one consensus pattern. We can now align such consensus patterns against arbitrary text to extract new PPIs. Our system yields a maximum recall of 69% –which was the best reported among all participating systems–, a maximum precision of 45% and maximum F1-measure of 41% on the BioCreative test set. Our method works completely independent from the training corpus, which we did not use at any stage. Thus, we intrinsically exclude any risk of overfitting, and believe that our approach should work equally well for related extraction problems, such as finding protein–disease associations.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bioinformatics Analysis of Upstream Region and Protein Structure of Fungal Phytase Gene

Phytase increases the bioavailability of phytate phosphorus in seed-based animal feeds and reduces the phosphorus pollution of animal waste. Since most animal feeds for pellets are heated up to 65-80 °C, the production of a thermostable structure for phytase can be useful. In this study, we sought to perform bioinformatics analysis of the upstream region and protein structure of fungal phytase ...

متن کامل

P-127: Characterization of Filia, A Maternal Effect Gene, in Bovine Oocytes and Embryos

Background: Genetic analysis in mice has lead to find about maternal effect genes such as Filia. Filia knock out mice have a 50% decrease in fertility. Filia dysfunction causes disorders in pre-implantation development. Mutations in human Filia gene, cause FBHM (Familial Biparental Hydatidiform Mole) in women. Filia protein in mice is homologous to that of rat and human, so this idea has emerge...

متن کامل

Mining relations from the biomedical literature

Text mining deals with the automated annotation of texts and the extraction of facts from textual data for subsequent analysis. Such texts range from short articles and abstracts to large documents, for instance web pages and scientific articles, but also include textual descriptions in otherwise structured databases. This thesis focuses on two key problems in biomedical text mining: relationsh...

متن کامل

Prediction of Protein Sub-Mitochondria Locations Using Protein Interaction Networks

Background: Prediction of the protein localization is among the most important issues in the bioinformatics that is used for the prediction of the proteins in the cells and organelles such as mitochondria. In this study, several machine learning algorithms are applied for the prediction of the intracellular protein locations. These algorithms use the features extracted from pro...

متن کامل

Dengue virus type-3 envelope protein domain III; expression and immunogenicity

Objective(s): Production of a recombinant and immunogenic antigen using dengue virus type-3 envelope protein is a key point in dengue vaccine development and diagnostic researches. The goals of this study were providing a recombinant protein from dengue virus type-3 envelope protein and evaluation of its immunogenicity in mice. Materials and Methods: Multiple amino acid sequences of different i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007